to go along with
Modern Data Science with R, 3rd edition by Baumer, Kaplan, and Horton
Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani
geom_point()aes() functionwday.ggplot() function.aes() function.aes() function| year | Algeria | Brazil | Columbia |
|---|---|---|---|
| 2000 | 7 | 12 | 16 |
| 2001 | 9 | 14 | 18 |
| country | Y2000 | Y2001 |
|---|---|---|
| Algeria | 7 | 9 |
| Brazil | 12 | 14 |
| Columbia | 16 | 18 |
| country | year | value |
|---|---|---|
| Algeria | 2000 | 7 |
| Algeria | 2001 | 9 |
| Brazil | 2000 | 12 |
| Brazil | 2001 | 14 |
| Columbia | 2000 | 16 |
| Columbia | 2001 | 18 |
#(a)
babynames |>
group_by(year, sex) |>
summarize(totalBirths = sum(num))
#(b)
group_by(babynames, year, sex) |>
summarize(totalBirths = sum(num))
#(c)
group_by(babynames, year, sex) |>
summarize(totalBirths = mean(num))
#(d)
temp <- group_by(babynames, year, sex)
summarize(temp, totalBirths = sum(num))
#(e)
summarize(group_by(babynames, year, sex),
totalBirths = sum(num))filter()arrange()select()mutate()group_by()(year, sex)(year, name)(year, num)(sex, name)(sex, num)n_distinct(name)n_distinct(n)sum(name)sum(num)mean(num)babynames <- babynames::babynames |>
rename(num = n)
babynames |>
filter(name %in% c("Jane", "Mary")) |>
# just the Janes and Marys
group_by(name, year) |>
# for each year for each name
summarize(total = sum(num))# A tibble: 276 × 3
# Groups: name [2]
name year total
<chr> <dbl> <int>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(number = sum(num))# A tibble: 276 × 3
# Groups: name [2]
name year number
<chr> <dbl> <int>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(n_distinct(name))# A tibble: 276 × 3
# Groups: name [2]
name year `n_distinct(name)`
<chr> <dbl> <int>
1 Jane 1880 1
2 Jane 1881 1
3 Jane 1882 1
4 Jane 1883 1
5 Jane 1884 1
6 Jane 1885 1
7 Jane 1886 1
8 Jane 1887 1
9 Jane 1888 1
10 Jane 1889 1
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(n_distinct(num))# A tibble: 276 × 3
# Groups: name [2]
name year `n_distinct(num)`
<chr> <dbl> <int>
1 Jane 1880 1
2 Jane 1881 1
3 Jane 1882 1
4 Jane 1883 1
5 Jane 1884 1
6 Jane 1885 1
7 Jane 1886 1
8 Jane 1887 1
9 Jane 1888 1
10 Jane 1889 1
# ℹ 266 more rows
Error in `summarize()`:
ℹ In argument: `sum(name)`.
ℹ In group 1: `name = "Jane"` and `year = 1880`.
Caused by error in `base::sum()`:
! invalid 'type' (character) of argument
# A tibble: 276 × 3
# Groups: name [2]
name year `mean(num)`
<chr> <dbl> <dbl>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
# A tibble: 276 × 3
# Groups: name [2]
name year `median(num)`
<chr> <dbl> <dbl>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
gdpyeargdpvalcountry–countrygdpyeargdpvalcountry–countrygdpyeargdpvalcountry–countryggplot() code. Which data frame should you use?35
pivot_wider() on raw datapivot_longer() on raw data# A tibble: 18 × 11
Subject day_0 day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 308 250. 259. 251. 321. 357. 415. 382. 290. 431. 466.
2 309 223. 205. 203. 205. 208. 216. 214. 218. 224. 237.
3 310 199. 194. 234. 233. 229. 220. 235. 256. 261. 248.
4 330 322. 300. 284. 285. 286. 298. 280. 318. 305. 354.
5 331 288. 285 302. 320. 316. 293. 290. 335. 294. 372.
6 332 235. 243. 273. 310. 317. 310 454. 347. 330. 254.
7 333 284. 290. 277. 300. 297. 338. 332. 349. 333. 362.
8 334 265. 276. 243. 255. 279. 284. 306. 332. 336. 377.
9 335 242. 274. 254. 271. 251. 255. 245. 235. 236. 237.
10 337 312. 314. 292. 346. 366. 392. 404. 417. 456. 459.
11 349 236. 230. 239. 255. 251. 270. 282. 308. 336. 352.
12 350 256. 243. 256. 256. 269. 330. 379. 363. 394. 389.
13 351 251. 300. 270. 281. 272. 305. 288. 267. 322. 348.
14 352 222. 298. 327. 347. 349. 353. 354. 360. 376. 389.
15 369 272. 268. 257. 278. 315. 317. 298. 348. 340. 367.
16 370 225. 235. 239. 240. 268. 344. 281. 348. 365. 372.
17 371 270. 272. 278. 282. 279. 285. 259. 305. 351. 369.
18 372 269. 273. 298. 311. 287. 330. 334. 343. 369. 364.
sleep_long <- sleep_wide |>
pivot_longer(cols = -Subject,
names_to = "day",
names_prefix = "day_",
values_to = "reaction_time")
sleep_long# A tibble: 180 × 3
Subject day reaction_time
<dbl> <chr> <dbl>
1 308 0 250.
2 308 1 259.
3 308 2 251.
4 308 3 321.
5 308 4 357.
6 308 5 415.
7 308 6 382.
8 308 7 290.
9 308 8 431.
10 308 9 466.
# ℹ 170 more rows
right_join()?36right_join()?37namebandplaysplays variable in a full_join()?38NANULLaddTen() function. The following output is a result of which map_*() call?39map(c(1,4,7), addTen)map_dbl(c(1,4,7), addTen)map_chr(c(1,4,7), addTen)map_lgl(c(1,4,7), addTen)[1] "11.000000" "14.000000" "17.000000"
map(c(1, 4, 7), addTen)map(list(1, 4, 7), addTen)map(data.frame(a=1, b=4, c=7), addTen)map(c(1, 4, 7), addTen)map(c(1, 4, 7), ~addTen(.x))map(c(1, 4, 7), ~addTen)map(c(1, 4, 7), function(hi) (hi + 10))map(c(1, 4, 7), ~(.x + 10))ifelse() function takes the arguments:43set.seed() function45c(4, 10, 8, 1, 2, 4)68
c(4, 4, 4, 4, 4, 4)c(4, 10, 8, 1, 2, 4)c(1, 2, 2, 4, 4, 2)c(10, 8, 1, 1, 8, 10)c(1, 2, 4, 3, 4, 10)kknn method can use any distance measure.78k in k-NN refers to79
k groupsk partitionsk neighborsV in V-fold CV refers to80
V groupsV partitionsV neighbors
Suppose you gave the correct answer in previous question. What do you think that is actually happening?
You increase the complexity(degree of polynomial of this kernel). What will happen?
104. Suppose you are using SVM with linear kernel of polynomial degree 2, Now think that you have applied this on data and found that it perfectly fit the data that means, Training and testing accuracy is 100%.
In the previous question after increasing the complexity you found that training accuracy was still 100%. According to you what is the reason behind that?
106. Suppose you have trained an SVM classifier with a Gaussian kernel, and it learned the following decision boundary on the training set:
You suspect that the SVM is underfitting your dataset. Should you try increasing or decreasing C? Increasing or decreasing gamma?
When you measure the SVM’s performance on a cross validation set, it does poorly. Should you try increasing or decreasing C? Increasing or decreasing gamma?
TRUE
FALSE
True
False
Objects in each cluster tend to be similar to each other and dissimilar to objects in the other clusters.
Cluster analysis is a type of unsupervised learning.
Groups or clusters are suggested by the data, not defined a priori.
Cluster analysis is a technique for analyzing data when the response variable is categorical and the predictor variables are continuous in nature.
dendrogram scatterplot scree plot segment plot
Clusters are formed by dividing this cluster into smaller and smaller clusters.
TRUE
FALSE
116. k-means is a clustering procedure characterized referred to as ________.
TRUE
FALSE
wherever you are, make sure you are communicating with me when you have questions!
wherever you are, make sure you are communicating with me when you have questions!
no right answer here!
Yes! All the responses are reasons to make a figure.
aes() functionwday.aes() functionanswers may vary. I’d say c. putting the work in context. Others might say b. facilitating comparison or d. simplifying the story. However, I don’t think a correct answer is a. making the data stand out.
mean() (average) instead of the sum(). The other commands compute the total number of births broken down by year and sex.filter()(year, name)sum(num)running the different code chunks with relevant output.
-countryyeargdpval (if possible, good idea to name variables something different from the name of the data frame)pivot_longer() on raw data. The reference to the study is: Gregory Belenky, Nancy J. Wesensten, David R. Thorne, Maria L. Thomas, Helen C. Sing, Daniel P. Redmond, Michael B. Russo and Thomas J. Balkin (2003) Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. Journal of Sleep Research 12, 1–12.NA (it would be NULL in SQL)map_chr(c(1,4,7), addTen) because the output is in quotes, the values are strings, not numbers.map() function allows vectors, lists, and data frames as input.map(c(1, 4, 7), ~addTen). The ~ acts on functions that do not have their own name or that are defined by function(...). By adding the argument (.x) we’ve expanded the addTen() function, and so it needs a ~. The addTen() function all alone does not use a ~.we always need d. random sampling / random allocation for appropriate conclusions. The theory is derived from b. normal data. If c. \(n \geq 30\), then the theory holds really well, regardless of whether the data are normal.
c(1, 2, 4, 3, 4, 10) because there is no 3 in the original dataset.p. When p=2, Minkowski is the same as Euclidean.k neighborsV partitions